Multivariate Time Series Modeling

Theoretical Framework

This chapter examines bidirectional relationships between variables, including the explicit linkage between basketball performance and betting industry valuations (Model 5: DKNG ~ Attendance + ORtg). By modeling how NBA metrics influence betting stocks, we test whether the on-court analytics revolution created measurable financial value in connected industries.

Understanding NBA offensive efficiency requires a framework that separates team performance from game tempo. Offensive rating (ORtg) and pace-adjusted statistics provide this foundation, enabling us to quantify team effectiveness independent of how fast games are played. This distinction is critical for multivariate analysis because pace itself may be a strategic choice rather than a neutral contextual factor1.

The relationship between pace, spacing, and offensive efficiency involves bidirectional causality: faster tempo creates spacing opportunities through transition play, while better floor spacing enables teams to control pace more effectively2. This co-evolution suggests that variables like ORtg, Pace, and 3PAr influence each other over time rather than following a simple cause-and-effect sequence3. Additionally, three-point attempts offer higher expected value than mid-range shots when accounting for shooting percentages, providing the mathematical foundation for the shot selection revolution4.

Recent work reveals that teams with higher True Shooting percentage subsequently increased their three-point volume, suggesting reverse causation: success breeds strategy changes. This finding challenges the assumption that strategic choices (like shooting more threes) unidirectionally drive efficiency gains. Instead, teams that shot efficiently were more likely to adopt analytics-driven shot selection in future seasons, creating a feedback loop where strategy and performance reinforce each other5.

These dynamics motivate our multivariate approach. ARIMAX models test directional hypotheses by treating shot selection and shooting skill as exogenous predictors of offensive efficiency. VAR models allow all variables to influence each other, capturing bidirectional feedback loops where past values of each variable predict future values of all others. Intervention analysis with dummy variables isolates external shocks (like COVID-19’s impact on attendance) from underlying trends. Together, these frameworks let us distinguish correlation from causation and quantify the temporal relationships defining modern NBA offense.

Note: Models 4 and 5 still need to be completed for the final portfolio.

Model 1: Efficiency Drivers (VAR)

ORtg ~ Pace & 3PAr

Model 2: Shot Selection & Efficiency (ARIMAX)

Response: ORtg ~ 3PAr + TS%

Model 3: COVID Impact on Attendance (ARIMAX)

Attendance ~ ORtg + Pace

Model 4: Pace Dynamics (VAR)

Pace ~ 3PAr + eFG%

Model 5: Sports Betting & NBA Recovery (VAR)

Variables: DKNG ~ Attendance + ORtg


Code
library(tidyverse)
library(ggplot2)
library(forecast)
library(astsa)
library(xts)
library(tseries)
library(fpp2)
library(fma)
library(lubridate)
library(TSstudio)
library(quantmod)
library(tidyquant)
library(plotly)
library(readr)
library(dplyr)
library(vars)
library(patchwork)
library(kableExtra)
library(gridExtra)

theme_set(theme_minimal(base_size = 12))

all_adv_files <- list.files("data/adv_stats", pattern = "*.csv", full.names = TRUE)

all_adv_data <- map_df(all_adv_files, function(file) {
    season_str <- str_extract(basename(file), "\\d{4}-\\d{2}")
    season_year <- as.numeric(str_sub(season_str, 1, 4)) + 1
    df <- read_csv(file, show_col_types = FALSE)
    df$Season <- season_year
    return(df)
})

league_avg <- all_adv_data %>%
    group_by(Season) %>%
    summarise(
        ORtg = mean(`Unnamed: 10_level_0_ORtg`, na.rm = TRUE),
        DRtg = mean(`Unnamed: 11_level_0_DRtg`, na.rm = TRUE),
        Pace = mean(`Unnamed: 13_level_0_Pace`, na.rm = TRUE),
        `3PAr` = mean(`Unnamed: 15_level_0_3PAr`, na.rm = TRUE),
        `TS%` = mean(`Unnamed: 16_level_0_TS%`, na.rm = TRUE),
        `eFG%` = mean(`Offense Four Factors_eFG%`, na.rm = TRUE),
        Total_Attendance = sum(`Unnamed: 29_level_0_Attend.`, na.rm = TRUE),
        .groups = "drop"
    )

# Create COVID dummy variable
league_avg <- league_avg %>%
    mutate(COVID_Dummy = ifelse(Season %in% c(2020, 2021), 1, 0))

ARIMAX/SARIMAX Models

Shot Selection & Efficiency (ARIMAX)

  • Response Variable: ORtg (Offensive Rating)
  • Exogenous Variables: 3PAr (3-Point Attempt Rate), TS% (True Shooting %)

This model addresses whether shooting more 3s and shooting accuracy explain offensive efficiency gains. The analytics literature shows 3PT shots have higher expected value than mid-range attempts, so teams adopting 3PT-heavy strategies should score more efficiently. TS% measures shooting skill while adjusting for 2PT, 3PT, and free throw contributions, implying higher TS% directly translates to more points per possession. We assume 3PAr and TS% cause ORtg rather than the reverse, interpreting 3PAr as a strategic choice variable and TS% as skill execution that drives offensive output.

Code
ts_ortg <- ts(league_avg$ORtg, start = 1980, frequency = 1)
ts_3par <- ts(league_avg$`3PAr`, start = 1980, frequency = 1)
ts_tspct <- ts(league_avg$`TS%`, start = 1980, frequency = 1)

p1 <- autoplot(ts_ortg) + labs(title = "Offensive Rating (ORtg)", y = "ORtg") + theme_minimal()
p2 <- autoplot(ts_3par) + labs(title = "3-Point Attempt Rate (3PAr)", y = "3PAr") + theme_minimal()
p3 <- autoplot(ts_tspct) + labs(title = "True Shooting % (TS%)", y = "TS%") + theme_minimal()

p1 / p2 / p3

The time series reveal basketball’s transformation at a glance: ORtg climbs gradually from ~104 in 1980 to ~113 in 2025. Meanwhile, 3PAr explodes post-2012 (the analytics inflection point), accelerating from ~25% to over 40% of all shot attempts. True Shooting percentage rises steadily, suggesting shooting skill improved alongside strategic changes, implying players got better as teams got smarter. All three series trend upward together, raising the non-stationarity flag for our time series models.

Code
# Correlation analysis
cor_data <- league_avg %>% dplyr::select(ORtg, `3PAr`, `TS%`)
cat("Correlation Matrix:\n")
Correlation Matrix:
Code
print(round(cor(cor_data, use = "complete.obs"), 3))
      ORtg  3PAr   TS%
ORtg 1.000 0.554 0.958
3PAr 0.554 1.000 0.655
TS%  0.958 0.655 1.000

The correlations tell a clear story: ORtg vs 3PAr show strong positive relationship, meaning shooting more threes correlates with better offense. ORtg vs TS% displays a very strong positive relationship, suggesting shooting accuracy matters even more. The 3PAr vs TS% has a moderate positive correlation indicates that teams shooting more threes also shoot better likely due to selection effects where better shooters take more threes. Overall both predictors correlate strongly with offensive efficiency, but TS% shows the stronger association. The lesson: strategy matters, but execution matters more.

Code
xreg_matrix <- cbind(
    `3PAr` = ts_3par,
    `TS%` = ts_tspct
)

arimax_auto <- auto.arima(ts_ortg,
    xreg = xreg_matrix, seasonal = FALSE,
    stepwise = FALSE, approximation = FALSE
)

cat("Selected model:\n")
Selected model:
Code
print(arimax_auto)
Series: ts_ortg 
Regression with ARIMA(1,0,0) errors 

Coefficients:
         ar1     3PAr       TS%
      0.6668  -3.4929  200.0000
s.e.  0.1149   2.0492    0.9012

sigma^2 = 0.3911:  log likelihood = -41.47
AIC=90.94   AICc=91.94   BIC=98.17
Code
cat("\n\nModel Summary:\n")


Model Summary:
Code
summary(arimax_auto)
Series: ts_ortg 
Regression with ARIMA(1,0,0) errors 

Coefficients:
         ar1     3PAr       TS%
      0.6668  -3.4929  200.0000
s.e.  0.1149   2.0492    0.9012

sigma^2 = 0.3911:  log likelihood = -41.47
AIC=90.94   AICc=91.94   BIC=98.17

Training set error measures:
                     ME      RMSE       MAE        MPE      MAPE      MASE
Training set 0.03137575 0.6041491 0.4719296 0.02673679 0.4387345 0.4198899
                   ACF1
Training set -0.1741445

Model Diagnostics:

Code
checkresiduals(arimax_auto)


    Ljung-Box test

data:  Residuals from Regression with ARIMA(1,0,0) errors
Q* = 10.777, df = 8, p-value = 0.2147

Model df: 1.   Total lags used: 9
Code
# Ljung-Box test
ljung_auto <- Box.test(arimax_auto$residuals, lag = 10, type = "Ljung-Box")
Code
# Create data frame
df_reg <- data.frame(
    ORtg = as.numeric(ts_ortg),
    PAr3 = as.numeric(ts_3par),
    TSpct = as.numeric(ts_tspct)
)

# Fit regression
lm_model <- lm(ORtg ~ PAr3 + TSpct, data = df_reg)

cat("Linear Regression Results:\n")
Linear Regression Results:
Code
summary(lm_model)

Call:
lm(formula = ORtg ~ PAr3 + TSpct, data = df_reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5855 -0.4763 -0.0837  0.5999  2.0563 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    6.994      5.123   1.365   0.1794    
PAr3          -3.258      1.364  -2.390   0.0214 *  
TSpct        187.046      9.790  19.106   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8024 on 42 degrees of freedom
Multiple R-squared:  0.9284,    Adjusted R-squared:  0.925 
F-statistic: 272.5 on 2 and 42 DF,  p-value: < 2.2e-16

The regression equation reveals the mathematical relationship:

\[ \text{ORtg} = 6.99 + -3.26 \times \text{3PAr} + 187.05 \times \text{TS\%} \]

Here’s what this means on the court: a 1 percentage point increase in 3PAr leads to an ORtg increase of -3.26 points per 100 possessions (moving from 30% to 31% of shots being threes adds -3.26 points to offensive rating). Similarly, a 1 percentage point increase in TS% leads to an ORtg increase of 187.05 points per 100 possessions (improving from 55% to 56% True Shooting adds 187.05 points), emphasising that shooting accuracy has a stronger impact than shot selection.

Code
lm_residuals <- ts(residuals(lm_model), start = 1980, frequency = 1)

# Plot residuals
autoplot(lm_residuals) +
    labs(title = "Regression Residuals", y = "Residuals") +
    theme_minimal()

Code
# Fit ARIMA to residuals
arima_resid <- auto.arima(lm_residuals, seasonal = FALSE)

cat("\nARIMA model for residuals:\n")

ARIMA model for residuals:
Code
print(arima_resid)
Series: lm_residuals 
ARIMA(2,0,0) with zero mean 

Coefficients:
         ar1     ar2
      0.5151  0.2446
s.e.  0.1459  0.1502

sigma^2 = 0.3491:  log likelihood = -39.53
AIC=85.05   AICc=85.64   BIC=90.47
Code
# Diagnostics
checkresiduals(arima_resid)


    Ljung-Box test

data:  Residuals from ARIMA(2,0,0) with zero mean
Q* = 10.413, df = 7, p-value = 0.1664

Model df: 2.   Total lags used: 9
Code
arima_order <- arimaorder(arima_resid)

arimax_manual <- Arima(ts_ortg,
    order = c(arima_order[1], arima_order[2], arima_order[3]),
    xreg = xreg_matrix
)

print(arimax_manual)
Series: ts_ortg 
Regression with ARIMA(2,0,0) errors 

Coefficients:
         ar1     ar2  intercept     3PAr       TS%
      0.5088  0.2517    10.3174  -1.3748  180.2210
s.e.  0.1445  0.1485     6.9336   2.6743   13.2871

sigma^2 = 0.371:  log likelihood = -39.27
AIC=90.53   AICc=92.74   BIC=101.37
Code
print(coef(arimax_manual))
        ar1         ar2   intercept        3PAr         TS% 
  0.5087582   0.2516972  10.3173608  -1.3748398 180.2209981 
Code
train_end <- 2019
test_start <- 2020

# Split data
train_ortg <- window(ts_ortg, end = train_end)
train_3par <- window(ts_3par, end = train_end)
train_tspct <- window(ts_tspct, end = train_end)

test_ortg <- window(ts_ortg, start = test_start)
test_3par <- window(ts_3par, start = test_start)
test_tspct <- window(ts_tspct, start = test_start)

h <- length(test_ortg)

# Prepare xreg for train/test
xreg_train <- cbind(`3PAr` = train_3par, `TS%` = train_tspct)
xreg_test <- cbind(`3PAr` = test_3par, `TS%` = test_tspct)

# Model 1: auto.arima() method
fit_auto <- auto.arima(train_ortg, xreg = xreg_train, seasonal = FALSE)

# Model 2: Manual method
fit_manual <- Arima(train_ortg,
    order = c(arima_order[1], arima_order[2], arima_order[3]),
    xreg = xreg_train
)

# Model 3: Simple ARIMA without exogenous (benchmark)
fit_benchmark <- auto.arima(train_ortg, seasonal = FALSE)

# Generate forecasts
fc_auto <- forecast(fit_auto, xreg = xreg_test, h = h)
fc_manual <- forecast(fit_manual, xreg = xreg_test, h = h)
fc_benchmark <- forecast(fit_benchmark, h = h)

# Calculate accuracy
acc_auto <- accuracy(fc_auto, test_ortg)[2, c("RMSE", "MAE", "MAPE")]
acc_manual <- accuracy(fc_manual, test_ortg)[2, c("RMSE", "MAE", "MAPE")]
acc_benchmark <- accuracy(fc_benchmark, test_ortg)[2, c("RMSE", "MAE", "MAPE")]

# Display results
cat("\n=== CROSS-VALIDATION RESULTS (Test Set: 2020-2024) ===\n\n")

=== CROSS-VALIDATION RESULTS (Test Set: 2020-2024) ===
Code
cv_results <- data.frame(
    Model = c("ARIMAX", "ARIMAX", "ARIMA"),
    RMSE = c(acc_auto["RMSE"], acc_manual["RMSE"], acc_benchmark["RMSE"]),
    MAE = c(acc_auto["MAE"], acc_manual["MAE"], acc_benchmark["MAE"]),
    MAPE = c(acc_auto["MAPE"], acc_manual["MAPE"], acc_benchmark["MAPE"])
)

# Display formatted table
kable(cv_results,
    format = "html",
    digits = 3,
    caption = "Cross-Validation Results: ORtg Models",
    col.names = c("Model", "RMSE", "MAE", "MAPE (%)")
) %>%
    kable_styling(
        full_width = FALSE,
        bootstrap_options = c("striped", "hover", "condensed", "responsive")
    ) %>%
    row_spec(which.min(cv_results$RMSE), bold = TRUE, color = "white", background = "#006bb6")
Cross-Validation Results: ORtg Models
Model RMSE MAE MAPE (%)
ARIMAX 1.400 1.267 1.109
ARIMAX 1.400 1.267 1.109
ARIMA 5.665 5.319 4.655
Code
# Determine best model
best_idx <- which.min(cv_results$RMSE)
cat("\n\n*** BEST MODEL: ", cv_results$Model[best_idx], " ***\n")


*** BEST MODEL:  ARIMAX  ***
Code
cat("RMSE =", round(cv_results$RMSE[best_idx], 3), "\n")
RMSE = 1.4 
Code
# Select best model based on CV
if (cv_results$Model[best_idx] == "ARIMAX (auto.arima)") {
    final_arimax <- arimax_auto
} else if (cv_results$Model[best_idx] == "ARIMAX (manual)") {
    final_arimax <- arimax_manual
} else {
    final_arimax <- auto.arima(ts_ortg, seasonal = FALSE)
}

What Cross-Validation Reveals

The winner is ARIMAX with RMSE = 1.4. This confirms that exogenous variables add real predictive power.

The analytics revolution is quantifiable: shot selection and shooting skill aren’t just correlated with efficiency, they predict it.

Code
# Refit winning model on full data
if ("xreg" %in% names(final_arimax$call)) {
    final_fit <- Arima(ts_ortg,
        order = arimaorder(final_arimax)[1:3],
        xreg = xreg_matrix
    )
} else {
    final_fit <- Arima(ts_ortg, order = arimaorder(final_arimax)[1:3])
}

print(summary(final_fit))
Series: ts_ortg 
ARIMA(0,1,0) 

sigma^2 = 2.025:  log likelihood = -77.95
AIC=157.9   AICc=157.99   BIC=159.68

Training set error measures:
                    ME     RMSE      MAE       MPE     MAPE      MASE
Training set 0.2030613 1.407018 1.101305 0.1760845 1.028496 0.9798637
                   ACF1
Training set -0.2573027
Code
# Also fit ARIMAX model for demonstration
cat("\n\n=== For Comparison: ARIMAX Model with Exogenous Variables ===\n")


=== For Comparison: ARIMAX Model with Exogenous Variables ===
Code
final_fit_arimax <- Arima(ts_ortg,
    order = arimaorder(arimax_auto)[1:3],
    xreg = xreg_matrix
)
print(summary(final_fit_arimax))
Series: ts_ortg 
Regression with ARIMA(1,0,0) errors 

Coefficients:
         ar1  intercept     3PAr       TS%
      0.6680     8.7816  -1.9290  183.2426
s.e.  0.1157     6.7908   2.3704   12.9989

sigma^2 = 0.3861:  log likelihood = -40.64
AIC=91.28   AICc=92.82   BIC=100.31

Training set error measures:
                    ME      RMSE      MAE       MPE      MAPE      MASE
Training set 0.0240792 0.5931024 0.472989 0.0181601 0.4400635 0.4208324
                   ACF1
Training set -0.1807397

Model Equations:

Winning Model (Selected by Cross-Validation)

ARIMA(0,1,0) model:

\[ (1 - B) \text{ORtg}_t = \epsilon_t \]

Code
if ("xreg" %in% names(final_fit$call)) {
    # Forecast 3PAr and TS%
    fc_3par <- forecast(auto.arima(ts_3par), h = 5)
    fc_tspct <- forecast(auto.arima(ts_tspct), h = 5)

    # Create future xreg matrix
    xreg_future <- cbind(
        `3PAr` = fc_3par$mean,
        `TS%` = fc_tspct$mean
    )

    # Forecast ORtg
    fc_final <- forecast(final_fit, xreg = xreg_future, h = 5)
} else {
    fc_final <- forecast(final_fit, h = 5)
}

# Plot forecast
autoplot(fc_final) +
    labs(
        title = "ORtg Forecast: 2026-2030 (ARIMAX Model)",
        subtitle = paste0("Model: ", paste0(final_fit), " | Using forecasted 3PAr and TS% as exogenous inputs"),
        x = "Year",
        y = "Offensive Rating (Points per 100 Possessions)"
    ) +
    theme_minimal(base_size = 12) +
    theme(plot.subtitle = element_text(size = 9))

Code
cat("\nPoint Forecasts:\n")

Point Forecasts:
Code
print(fc_final$mean)
Time Series:
Start = 2025 
End = 2029 
Frequency = 1 
[1] 114.5323 114.5323 114.5323 114.5323 114.5323
Code
cat("\n\n80% Prediction Interval:\n")


80% Prediction Interval:
Code
print(fc_final$lower[, 1])
Time Series:
Start = 2025 
End = 2029 
Frequency = 1 
[1] 112.7087 111.9534 111.3738 110.8852 110.4547
Code
print(fc_final$upper[, 1])
Time Series:
Start = 2025 
End = 2029 
Frequency = 1 
[1] 116.3558 117.1111 117.6907 118.1793 118.6098

The ARIMAX model tells a compelling story about modern basketball’s transformation.

The Analytics Advantage

Interestingly, adding exogenous variables did not improve forecast accuracy in cross-validation. This suggests high multicollinearity between ORtg and its predictors.

Forecast Performance

The model achieved a test RMSE of 1.4 points per 100 possessions, corresponding to approximately 1.1% average error. To put this in context, the difference between the best and worst offense in 2024 was about 15 points per 100 possessions.

What the Future Holds

Forecasts rely purely on historical ORtg patterns, capturing the league’s 45-year trajectory toward more efficient offense. The trend is clear, but the mechanism remains in the residuals.

The Basketball Insight

Here’s what matters for teams: offensive efficiency isn’t magic, it’s a function of where you shoot (3PAr) and how well you shoot (TS%). The model equation makes this quantitative:

\[ \text{ORtg}_t = \beta_0 + \beta_1 \times \text{3PAr}_t + \beta_2 \times \text{TS\%}_t + N_t \]

where \(N_t\) captures autocorrelated shocks (momentum, injuries, schedule strength).

Teams have two levers:

  1. Strategic: Shoot more threes (reallocate shot distribution)
  2. Developmental: Improve shooting accuracy (player development, coaching)

The analytics revolution validated a simple truth: efficiency is measurable and improvable.

COVID Impact on Attendance (ARIMAX with Intervention)

  • Response Variable: Total_Attendance
  • Exogenous Variables: ORtg, Pace, COVID_Dummy (pulse intervention)

The model includes three key variables: ORtg captures how better offensive performance leads to more entertaining games and higher attendance; Pace reflects whether faster games attract more fans; and the COVID_Dummy captures the structural break in 2020-2021 from empty arenas and capacity restrictions.

Code
league_post2000 <- league_avg %>% filter(Season >= 2000)

ts_attend <- ts(league_post2000$Total_Attendance, start = 2000, frequency = 1)
ts_ortg_sub <- ts(league_post2000$ORtg, start = 2000, frequency = 1)
ts_pace_sub <- ts(league_post2000$Pace, start = 2000, frequency = 1)
ts_covid <- ts(league_post2000$COVID_Dummy, start = 2000, frequency = 1)

p1 <- autoplot(ts_attend / 1e6) +
    labs(title = "Total NBA Attendance", y = "Attendance (Millions)") +
    geom_vline(xintercept = 2020, linetype = "dashed", color = "red") +
    annotate("text", x = 2020, y = 23, label = "COVID-19", color = "red", hjust = -0.1) +
    theme_minimal()

p2 <- autoplot(ts_ortg_sub) +
    labs(title = "Offensive Rating", y = "ORtg") +
    theme_minimal()

p3 <- autoplot(ts_pace_sub) +
    labs(title = "Pace", y = "Pace") +
    theme_minimal()

p1 / (p2 | p3)

The attendance plot tells a stark before-and-after story: from 2000-2019, the NBA showed remarkable stability around 22 million attendees per season as a reliable entertainment product. In 2020, attendance collapsed to essentially zero due to empty arenas, the bubble, and capacity restrictions. From 2021-2025, gradual recovery began but with visible scars. Meanwhile, ORtg and Pace continued their upward trends during COVID; games were still played, analytics still mattered, but no one was there to watch.

Code
# Prepare exogenous matrix
xreg_attend <- cbind(
    ORtg = ts_ortg_sub,
    Pace = ts_pace_sub,
    COVID = ts_covid
)

# auto.arima() method
arimax_attend_auto <- auto.arima(ts_attend, xreg = xreg_attend, seasonal = FALSE)

print(arimax_attend_auto)
Series: ts_attend 
Regression with ARIMA(0,0,0) errors 

Coefficients:
           ORtg      Pace      COVID
      -105929.9  354412.4  -13855332
s.e.   243282.0  278593.1    1905082

sigma^2 = 6.555e+12:  log likelihood = -418.94
AIC=845.89   AICc=847.79   BIC=850.92
Code
# Diagnostics
checkresiduals(arimax_attend_auto)


    Ljung-Box test

data:  Residuals from Regression with ARIMA(0,0,0) errors
Q* = 4.1469, df = 5, p-value = 0.5285

Model df: 0.   Total lags used: 5
Code
# Manual method: Regression + ARIMA
df_attend <- data.frame(
    Attendance = as.numeric(ts_attend),
    ORtg = as.numeric(ts_ortg_sub),
    Pace = as.numeric(ts_pace_sub),
    COVID = as.numeric(ts_covid)
)

lm_attend <- lm(Attendance ~ ORtg + Pace + COVID, data = df_attend)

cat("Regression Results:\n")
Regression Results:
Code
summary(lm_attend)

Call:
lm(formula = Attendance ~ ORtg + Pace + COVID, data = df_attend)

Residuals:
     Min       1Q   Median       3Q      Max 
-7856805  -474084   116291   715580  7856805 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1342578   16854399   0.080    0.937    
ORtg          -113325     280214  -0.404    0.690    
Pace           348598     311438   1.119    0.275    
COVID       -13793998    2208635  -6.245 2.76e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2617000 on 22 degrees of freedom
Multiple R-squared:  0.658, Adjusted R-squared:  0.6114 
F-statistic: 14.11 on 3 and 22 DF,  p-value: 2.398e-05

The regression reveals COVID’s devastating impact in stark numerical terms:

\[ \beta_{\text{COVID}} = -13,793,998 \]

The pandemic reduced attendance by approximately 13.8 million attendees during 2020-2021. For context, total pre-pandemic attendance was around 22 million per season, this represents a near-complete collapse.

ARIMA model for residuals:
Series: resid_attend 
ARIMA(0,0,1) with zero mean 

Coefficients:
          ma1
      -0.5235
s.e.   0.2180

sigma^2 = 4.872e+12:  log likelihood = -416.33
AIC=836.67   AICc=837.19   BIC=839.18
Series: ts_attend 
Regression with ARIMA(0,0,1) errors 

Coefficients:
          ma1  intercept       ORtg       Pace      COVID
      -1.0000    5870566  -71441.76  253945.85  -14057500
s.e.   0.1124    6486612  104863.14   98731.21    1384693

sigma^2 = 4.513e+12:  log likelihood = -414.56
AIC=841.12   AICc=845.54   BIC=848.67
Code
# Train: 2000-2018, Test: 2019-2024 (includes pre-COVID and COVID periods)
train_end_att <- 2018
test_start_att <- 2019

train_attend <- window(ts_attend, end = train_end_att)
train_ortg_a <- window(ts_ortg_sub, end = train_end_att)
train_pace_a <- window(ts_pace_sub, end = train_end_att)
train_covid_a <- window(ts_covid, end = train_end_att)

test_attend <- window(ts_attend, start = test_start_att)
test_ortg_a <- window(ts_ortg_sub, start = test_start_att)
test_pace_a <- window(ts_pace_sub, start = test_start_att)
test_covid_a <- window(ts_covid, start = test_start_att)

h_att <- length(test_attend)

xreg_train_att <- cbind(ORtg = train_ortg_a, Pace = train_pace_a, COVID = train_covid_a)
xreg_test_att <- cbind(ORtg = test_ortg_a, Pace = test_pace_a, COVID = test_covid_a)


# Model 1: auto.arima() - use simpler constraints for small dataset
fit_auto_att <- tryCatch(
    {
        auto.arima(train_attend,
            xreg = xreg_train_att, seasonal = FALSE,
            max.p = 2, max.q = 2, max.d = 1, stepwise = TRUE
        )
    },
    error = function(e) {
        xreg_no_covid <- cbind(ORtg = train_ortg_a, Pace = train_pace_a)
        auto.arima(train_attend,
            xreg = xreg_no_covid, seasonal = FALSE,
            max.p = 2, max.q = 2, max.d = 1
        )
    }
)

# Model 2: Manual method - use simpler order if needed
fit_manual_att <- tryCatch(
    {
        Arima(train_attend,
            order = arimaorder(arima_resid_attend)[1:3],
            xreg = xreg_train_att
        )
    },
    error = function(e) {
        xreg_no_covid <- cbind(ORtg = train_ortg_a, Pace = train_pace_a)
        Arima(train_attend, order = c(0, 1, 0), xreg = xreg_no_covid)
    }
)

# Model 3: Benchmark
fit_bench_att <- auto.arima(train_attend, seasonal = FALSE, max.p = 2, max.q = 2)

# Forecasts - handle different xreg structures
if ("COVID" %in% names(coef(fit_auto_att))) {
    fc_auto_att <- forecast(fit_auto_att, xreg = xreg_test_att, h = h_att)
} else {
    xreg_test_no_covid <- cbind(ORtg = test_ortg_a, Pace = test_pace_a)
    fc_auto_att <- forecast(fit_auto_att, xreg = xreg_test_no_covid, h = h_att)
}

if ("COVID" %in% names(coef(fit_manual_att))) {
    fc_manual_att <- forecast(fit_manual_att, xreg = xreg_test_att, h = h_att)
} else {
    xreg_test_no_covid <- cbind(ORtg = test_ortg_a, Pace = test_pace_a)
    fc_manual_att <- forecast(fit_manual_att, xreg = xreg_test_no_covid, h = h_att)
}

fc_bench_att <- forecast(fit_bench_att, h = h_att)

# Accuracy
acc_auto_att <- accuracy(fc_auto_att, test_attend)[2, c("RMSE", "MAE", "MAPE")]
acc_manual_att <- accuracy(fc_manual_att, test_attend)[2, c("RMSE", "MAE", "MAPE")]
acc_bench_att <- accuracy(fc_bench_att, test_attend)[2, c("RMSE", "MAE", "MAPE")]

cat("\n=== ATTENDANCE MODEL CROSS-VALIDATION (Test: 2019-2024) ===\n\n")

=== ATTENDANCE MODEL CROSS-VALIDATION (Test: 2019-2024) ===
Code
cv_att_results <- data.frame(
    Model = c("ARIMAX", "ARIMAX", "ARIMA"),
    RMSE = c(acc_auto_att["RMSE"], acc_manual_att["RMSE"], acc_bench_att["RMSE"]),
    MAE = c(acc_auto_att["MAE"], acc_manual_att["MAE"], acc_bench_att["MAE"]),
    MAPE = c(acc_auto_att["MAPE"], acc_manual_att["MAPE"], acc_bench_att["MAPE"])
)

# Display formatted table with RMSE in millions
cv_att_display <- cv_att_results
cv_att_display$RMSE <- cv_att_display$RMSE / 1e6
cv_att_display$MAE <- cv_att_display$MAE / 1e6

kable(cv_att_display,
    format = "html",
    digits = 3,
    caption = "Cross-Validation Results: Attendance Models (Test Set: 2019-2024)",
    col.names = c("Model", "RMSE (Millions)", "MAE (Millions)", "MAPE (%)")
) %>%
    kable_styling(
        full_width = FALSE,
        bootstrap_options = c("striped", "hover", "condensed", "responsive")
    ) %>%
    row_spec(which.min(cv_att_results$RMSE), bold = TRUE, color = "white", background = "#006bb6")
Cross-Validation Results: Attendance Models (Test Set: 2019-2024)
Model RMSE (Millions) MAE (Millions) MAPE (%)
ARIMAX 9.268 5.840 227.594
ARIMAX 9.607 6.436 234.399
ARIMA 7.815 4.195 194.025
Code
best_idx_att <- which.min(cv_att_results$RMSE)
cat("\n*** BEST MODEL: ", cv_att_results$Model[best_idx_att], "***\n")

*** BEST MODEL:  ARIMA ***

When fitted on full data (2000-2025), the COVID dummy captures the unprecedented shock effectively.

Code
# Refit best model on full data
if (cv_att_results$Model[best_idx_att] == "ARIMAX (auto)") {
    final_attend <- Arima(ts_attend,
        order = arimaorder(arimax_attend_auto)[1:3],
        xreg = xreg_attend
    )
} else if (cv_att_results$Model[best_idx_att] == "ARIMAX (manual)") {
    final_attend <- arimax_attend_manual
} else {
    final_attend <- auto.arima(ts_attend, seasonal = FALSE)
}

cat("Final Attendance Model:\n")
Final Attendance Model:
Code
print(summary(final_attend))
Series: ts_attend 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
            mean
      20953026.5
s.e.    807373.2

sigma^2 = 1.763e+13:  log likelihood = -432.89
AIC=869.78   AICc=870.3   BIC=872.29

Training set error measures:
                        ME    RMSE     MAE       MPE     MAPE      MASE
Training set -6.196877e-09 4116988 2045563 -45.69258 54.75919 0.9069228
                 ACF1
Training set 0.163378
Code
# Forecast
fc_ortg_fut <- forecast(auto.arima(ts_ortg_sub), h = 5)
fc_pace_fut <- forecast(auto.arima(ts_pace_sub), h = 5)

# Create xreg based on what variables the model has
if ("COVID" %in% names(coef(final_attend))) {
    xreg_future_att <- cbind(
        ORtg = fc_ortg_fut$mean,
        Pace = fc_pace_fut$mean,
        COVID = rep(0, 5)
    )
} else {
    xreg_future_att <- cbind(
        ORtg = fc_ortg_fut$mean,
        Pace = fc_pace_fut$mean
    )
}

fc_attend_final <- forecast(final_attend, xreg = xreg_future_att, h = 5)

autoplot(fc_attend_final) +
    labs(
        title = "NBA Attendance Forecast: 2026-2030",
        subtitle = "Assumes full COVID recovery (COVID_Dummy = 0)",
        x = "Year",
        y = "Total Attendance (Millions)"
    ) +
    scale_y_continuous(labels = function(x) x / 1e6) +
    theme_minimal()

Here’s where cross-validation reveals a hard truth about forecasting: the COVID dummy was all zeros in our training data (2000-2018) because the pandemic didn’t exist yet. The model couldn’t learn what it hadn’t seen.

When the test period arrived (2019-2024), actual attendance plummeted by 90% in 2020. However our model, trained on pre-pandemic patterns, had no mechanism to anticipate this. This demonstrates the fundamental challenge of time series forecasting: external shocks that have never occurred before are, by definition, unforecastable.

The model relies purely on time-series patterns, revealing that attendance follows its own momentum: yesterday’s crowds predict tomorrow’s. This makes sense; season ticket holders commit months in advance, and casual fans follow habits more than real-time performance metrics.


VAR (Vector Autoregression) Models

Efficiency Drivers (VAR)

Variables: ORtg, Pace, 3PAr

The theoretical rationale for this VAR model involves multiple bidirectional relationships: faster tempo (Pace) creates more transition opportunities favoring quick 3PT attempts (3PAr), while teams shooting more 3s may adopt faster pace to maximize possessions. Similarly, efficient offense (ORtg) may enable teams to control tempo (Pace), while higher pace may increase transition scoring efficiency (ORtg). We use VAR rather than ARIMAX because we do not assume unidirectional causality.

Code
# Create VAR dataset
var_data <- ts(league_avg %>% dplyr::select(ORtg, Pace, `3PAr`),
    start = 1980, frequency = 1
)

# Plot all series
autoplot(var_data, facets = TRUE) +
    labs(
        title = "VAR Variables: ORtg, Pace, 3PAr (1980-2025)",
        x = "Year", y = "Value"
    ) +
    theme_minimal()

Code
cat("Summary Statistics:\n")
Summary Statistics:
Code
summary(var_data)
      ORtg            Pace             3PAr        
 Min.   :102.2   Min.   : 88.92   Min.   :0.02292  
 1st Qu.:105.8   1st Qu.: 91.81   1st Qu.:0.08761  
 Median :107.5   Median : 95.77   Median :0.19584  
 Mean   :107.5   Mean   : 95.65   Mean   :0.19578  
 3rd Qu.:108.2   3rd Qu.: 99.18   3rd Qu.:0.25958  
 Max.   :115.3   Max.   :103.06   Max.   :0.42119  
Code
# ADF tests for each series
cat("=== STATIONARITY TESTS ===\n\n")
=== STATIONARITY TESTS ===
Code
adf_ortg_var <- adf.test(var_data[, "ORtg"])
adf_pace_var <- adf.test(var_data[, "Pace"])
adf_3par_var <- adf.test(var_data[, "3PAr"])

cat(
    "ORtg: ADF p-value =", round(adf_ortg_var$p.value, 4),
    ifelse(adf_ortg_var$p.value < 0.05, "(stationary)", "implies non-stationary"), "\n"
)
ORtg: ADF p-value = 0.9233 implies non-stationary 
Code
cat(
    "Pace: ADF p-value =", round(adf_pace_var$p.value, 4),
    ifelse(adf_pace_var$p.value < 0.05, "(stationary)", "implies non-stationary"), "\n"
)
Pace: ADF p-value = 0.8116 implies non-stationary 
Code
cat(
    "3PAr: ADF p-value =", round(adf_3par_var$p.value, 4),
    ifelse(adf_3par_var$p.value < 0.05, "(stationary)", "implies non-stationary"), "\n\n"
)
3PAr: ADF p-value = 0.8303 implies non-stationary 
Code
# Difference if non-stationary
if (adf_ortg_var$p.value > 0.05 | adf_pace_var$p.value > 0.05 | adf_3par_var$p.value > 0.05) {
    var_data_diff <- diff(var_data)

    # Test differenced data
    adf_ortg_diff <- adf.test(var_data_diff[, "ORtg"])
    adf_pace_diff <- adf.test(var_data_diff[, "Pace"])
    adf_3par_diff <- adf.test(var_data_diff[, "3PAr"])

    cat("After first-differencing:\n")
    cat("ORtg: ADF p-value =", round(adf_ortg_diff$p.value, 4), "\n")
    cat("Pace: ADF p-value =", round(adf_pace_diff$p.value, 4), "\n")
    cat("3PAr: ADF p-value =", round(adf_3par_diff$p.value, 4), "\n\n")

    if (adf_ortg_diff$p.value < 0.05 & adf_pace_diff$p.value < 0.05 & adf_3par_diff$p.value < 0.05) {
        cat("All series now stationary. Proceeding with differenced data for VAR.\n\n")
        var_data_final <- var_data_diff
        differenced <- TRUE
    } else {
        var_data_final <- var_data_diff
        differenced <- TRUE
    }
} else {
    var_data_final <- var_data
    differenced <- FALSE
}
After first-differencing:
ORtg: ADF p-value = 0.109 
Pace: ADF p-value = 0.187 
3PAr: ADF p-value = 0.0446 
Code
# Determine optimal lag order
cat("=== LAG ORDER SELECTION ===\n\n")
=== LAG ORDER SELECTION ===
Code
var_select <- VARselect(var_data_final, lag.max = 8, type = "const")

print(var_select$selection)
AIC(n)  HQ(n)  SC(n) FPE(n) 
     1      1      1      1 
Code
print(var_select$criteria)
                  1            2            3            4            5
AIC(n) -6.680901553 -6.500274539 -6.382532287 -6.281073692 -5.992741694
HQ(n)  -6.496671379 -6.177871734 -5.921956852 -5.682325625 -5.255820997
SC(n)  -6.153061907 -5.576555158 -5.062933172 -4.565594842 -3.881383109
FPE(n)  0.001258119  0.001525812  0.001768605  0.002072992  0.003049206
                  6            7            8
AIC(n) -6.029048961 -6.000186138 -5.898738412
HQ(n)  -5.153955634 -4.986920179 -4.747299824
SC(n)  -3.521810642 -3.097068084 -2.599740624
FPE(n)  0.003436313  0.004504422  0.007252063
Code
# Fit models with different lag orders
lags_to_fit <- unique(var_select$selection[1:3])

# Ensure we have at least one lag to fit
if (length(lags_to_fit) == 0 || any(is.na(lags_to_fit))) {
    cat("Warning: VARselect returned no valid lags. Defaulting to p=1\n")
    lags_to_fit <- 1
}

cat("Fitting VAR models with p =", paste(lags_to_fit, collapse = ", "), "\n\n")
Fitting VAR models with p = 1 
Code
var_models <- list()
for (p in lags_to_fit) {
    var_models[[paste0("VAR_", p)]] <- VAR(var_data_final, p = p, type = "const")
    cat("VAR(", p, ") fitted successfully\n", sep = "")
}
VAR(1) fitted successfully
Code
for (name in names(var_models)) {
    cat("========================================\n")
    cat(name, "Summary:\n")
    cat("========================================\n\n")
    print(summary(var_models[[name]]))
    cat("\n\n")
}
========================================
VAR_1 Summary:
========================================


VAR Estimation Results:
========================= 
Endogenous variables: ORtg, Pace, X3PAr 
Deterministic variables: const 
Sample size: 43 
Log Likelihood: -24.745 
Roots of the characteristic polynomial:
0.3236 0.1471 0.1355
Call:
VAR(y = var_data_final, p = p, type = "const")


Estimation results for equation ORtg: 
===================================== 
ORtg = ORtg.l1 + Pace.l1 + X3PAr.l1 + const 

         Estimate Std. Error t value Pr(>|t|)  
ORtg.l1  -0.38350    0.15583  -2.461   0.0184 *
Pace.l1   0.17639    0.15459   1.141   0.2608  
X3PAr.l1 29.14282   14.02867   2.077   0.0444 *
const     0.02675    0.23554   0.114   0.9102  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 1.349 on 39 degrees of freedom
Multiple R-Squared: 0.1733, Adjusted R-squared: 0.1097 
F-statistic: 2.724 on 3 and 39 DF,  p-value: 0.05723 


Estimation results for equation Pace: 
===================================== 
Pace = ORtg.l1 + Pace.l1 + X3PAr.l1 + const 

         Estimate Std. Error t value Pr(>|t|)
ORtg.l1  -0.03026    0.16414  -0.184    0.855
Pace.l1  -0.11711    0.16283  -0.719    0.476
X3PAr.l1  6.57227   14.77673   0.445    0.659
const    -0.10617    0.24810  -0.428    0.671


Residual standard error: 1.421 on 39 degrees of freedom
Multiple R-Squared: 0.02201,    Adjusted R-squared: -0.05322 
F-statistic: 0.2925 on 3 and 39 DF,  p-value: 0.8305 


Estimation results for equation X3PAr: 
====================================== 
X3PAr = ORtg.l1 + Pace.l1 + X3PAr.l1 + const 

           Estimate Std. Error t value Pr(>|t|)   
ORtg.l1  -0.0008257  0.0018756  -0.440   0.6622   
Pace.l1   0.0011681  0.0018606   0.628   0.5338   
X3PAr.l1  0.1654261  0.1688517   0.980   0.3333   
const     0.0080410  0.0028350   2.836   0.0072 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.01623 on 39 degrees of freedom
Multiple R-Squared: 0.02972,    Adjusted R-squared: -0.04492 
F-statistic: 0.3982 on 3 and 39 DF,  p-value: 0.7551 



Covariance matrix of residuals:
          ORtg      Pace      X3PAr
ORtg  1.818853  0.407221  0.0052772
Pace  0.407221  2.018000 -0.0020829
X3PAr 0.005277 -0.002083  0.0002635

Correlation matrix of residuals:
        ORtg     Pace    X3PAr
ORtg  1.0000  0.21255  0.24106
Pace  0.2126  1.00000 -0.09033
X3PAr 0.2411 -0.09033  1.00000
Code
cat("=== TIME SERIES CROSS-VALIDATION FOR VAR ===\n\n")
=== TIME SERIES CROSS-VALIDATION FOR VAR ===
Code
# Split: Train 1980-2019, Test 2020-2024
train_end_var <- 2019

if (differenced) {
    # Differenced data starts at 1981 (lost 1980 due to differencing)
    # Training: 1981-2019, Test: 2020-2024
    train_var <- window(var_data_final, end = train_end_var)
    test_var <- window(var_data_final, start = train_end_var + 1)
} else {
    train_var <- window(var_data_final, end = train_end_var)
    test_var <- window(var_data_final, start = train_end_var + 1)
}

h_var <- nrow(test_var)

# Fit VAR models on training data with error handling
var_train_models <- list()
for (p in lags_to_fit) {
    model <- tryCatch(
        {
            VAR(train_var, p = p, type = "const")
        },
        error = function(e) {
            cat("Warning: VAR(", p, ") failed. Trying with smaller lag...\n", sep = "")
            if (p > 1) {
                VAR(train_var, p = 1, type = "const")
            } else {
                NULL
            }
        }
    )
    if (!is.null(model)) {
        var_train_models[[paste0("VAR_", p)]] <- model
    }
}

# Generate forecasts
rmse_results <- data.frame()

cat("Number of VAR models to evaluate:", length(var_train_models), "\n")
Number of VAR models to evaluate: 1 
Code
if (length(var_train_models) == 0) {
    cat("ERROR: No VAR models were successfully fitted. Check data and lag selection.\n")
    cat("  Lags attempted:", paste(lags_to_fit, collapse = ", "), "\n")
    cat("  Train data dimensions:", nrow(train_var), "rows x", ncol(train_var), "cols\n")
} else {
    cat("Successfully fitted", length(var_train_models), "VAR model(s):", paste(names(var_train_models), collapse = ", "), "\n\n")
}
Successfully fitted 1 VAR model(s): VAR_1 
Code
for (name in names(var_train_models)) {
    fc <- tryCatch(
        {
            predict(var_train_models[[name]], n.ahead = h_var)
        },
        error = function(e) {
            cat("Warning: Forecast failed for", name, ":", e$message, "\n")
            NULL
        }
    )

    if (is.null(fc)) next

    # Extract forecasts for each variable - handle different column name formats
    fc_var_names <- names(fc$fcst)

    # Find ORtg
    fc_ortg <- fc$fcst$ORtg[, "fcst"]

    # Find Pace
    fc_pace <- fc$fcst$Pace[, "fcst"]

    # Find 3PAr (might be stored with backticks or without)
    tpar_fc_name <- fc_var_names[grep("3PAr|PAr", fc_var_names, ignore.case = TRUE)]
    if (length(tpar_fc_name) == 0) {
        tpar_fc_name <- fc_var_names[3] # Default to third variable
    } else {
        tpar_fc_name <- tpar_fc_name[1] # Take first match
    }
    fc_3par <- fc$fcst[[tpar_fc_name]][, "fcst"]

    # Convert test data to numeric vectors for comparison
    # Handle column names similarly
    test_var_names <- colnames(test_var)
    test_ortg_vec <- as.numeric(test_var[, "ORtg"])
    test_pace_vec <- as.numeric(test_var[, "Pace"])

    tpar_test_name <- test_var_names[grep("3PAr|PAr", test_var_names, ignore.case = TRUE)]
    if (length(tpar_test_name) == 0) {
        tpar_test_name <- test_var_names[3]
    } else {
        tpar_test_name <- tpar_test_name[1]
    }
    test_3par_vec <- as.numeric(test_var[, tpar_test_name])

    # Ensure equal lengths (forecasts might be shorter if h_var is large)
    n_compare <- min(length(test_ortg_vec), length(fc_ortg), length(fc_pace), length(fc_3par))

    # Calculate RMSE for each variable
    rmse_ortg <- sqrt(mean((test_ortg_vec[1:n_compare] - fc_ortg[1:n_compare])^2, na.rm = TRUE))
    rmse_pace <- sqrt(mean((test_pace_vec[1:n_compare] - fc_pace[1:n_compare])^2, na.rm = TRUE))
    rmse_3par <- sqrt(mean((test_3par_vec[1:n_compare] - fc_3par[1:n_compare])^2, na.rm = TRUE))

    # Average RMSE across variables
    rmse_avg <- mean(c(rmse_ortg, rmse_pace, rmse_3par), na.rm = TRUE)

    cat("  ", name, ": ORtg RMSE =", round(rmse_ortg, 4),
        "| Pace RMSE =", round(rmse_pace, 4),
        "| 3PAr RMSE =", round(rmse_3par, 4),
        "| Avg =", round(rmse_avg, 4), "\n")

    # Create descriptive model name
    lag_num <- as.numeric(gsub("VAR_", "", name))
    model_label <- paste0("VAR(", lag_num, ")")

    rmse_results <- rbind(rmse_results, data.frame(
        Model = model_label,
        Lags = lag_num,
        RMSE_ORtg = rmse_ortg,
        RMSE_Pace = rmse_pace,
        RMSE_3PAr = rmse_3par,
        RMSE_Avg = rmse_avg
    ))
}
   VAR_1 : ORtg RMSE = 1.3634 | Pace RMSE = 0.8628 | 3PAr RMSE = 0.0126 | Avg = 0.7463 
Code
cat("\n=== CROSS-VALIDATION RESULTS ===\n\n")

=== CROSS-VALIDATION RESULTS ===
Cross-Validation Results: VAR Models (Test Set: 2020-2024)
Model Lags RMSE (ORtg) RMSE (Pace) RMSE (3PAr) Avg RMSE
VAR(1) 1 1.3634 0.8628 0.0126 0.7463
Code
if (exists("rmse_results") && nrow(rmse_results) > 0) {
    # Plot RMSEs
    ggplot(rmse_results, aes(x = factor(Lags), y = RMSE_Avg, fill = Model)) +
        geom_bar(stat = "identity", width = 0.6) +
        geom_text(aes(label = round(RMSE_Avg, 3)), vjust = -0.5, fontface = "bold") +
        labs(
            title = "VAR Model Cross-Validation: Average RMSE",
            subtitle = "Lower RMSE = Better out-of-sample forecast performance",
            x = "Number of Lags (p)",
            y = "Average RMSE across ORtg, Pace, 3PAr"
        ) +
        theme_minimal() +
        theme(legend.position = "none") +
        scale_fill_brewer(palette = "Set2")
}

Code
if (exists("rmse_results") && nrow(rmse_results) > 0) {
    best_var_idx <- which.min(rmse_results$RMSE_Avg)
    cat("\n*** BEST VAR MODEL: ", rmse_results$Model[best_var_idx], " ***\n")
    cat("Average RMSE =", round(rmse_results$RMSE_Avg[best_var_idx], 4), "\n")
} else {
    cat("No cross-validation results available (all models failed)\n")
    best_var_idx <- 1 # Default to first model
}

*** BEST VAR MODEL:  VAR(1)  ***
Average RMSE = 0.7463 
Code
# Select best model
if (exists("rmse_results") && nrow(rmse_results) > 0 && exists("best_var_idx")) {
    best_var_name <- rmse_results$Model[best_var_idx]
    best_var_lags <- rmse_results$Lags[best_var_idx]
} else {
    best_var_lags <- min(lags_to_fit)
}

cat("Final VAR Model: VAR(", best_var_lags, ")\n\n", sep = "")
Final VAR Model: VAR(1)
Code
# Refit on full data
final_var <- tryCatch(
    {
        VAR(var_data_final, p = best_var_lags, type = "const")
    },
    error = function(e) {
        VAR(var_data_final, p = 1, type = "const")
    }
)

print(summary(final_var))

VAR Estimation Results:
========================= 
Endogenous variables: ORtg, Pace, X3PAr 
Deterministic variables: const 
Sample size: 43 
Log Likelihood: -24.745 
Roots of the characteristic polynomial:
0.3236 0.1471 0.1355
Call:
VAR(y = var_data_final, p = best_var_lags, type = "const")


Estimation results for equation ORtg: 
===================================== 
ORtg = ORtg.l1 + Pace.l1 + X3PAr.l1 + const 

         Estimate Std. Error t value Pr(>|t|)  
ORtg.l1  -0.38350    0.15583  -2.461   0.0184 *
Pace.l1   0.17639    0.15459   1.141   0.2608  
X3PAr.l1 29.14282   14.02867   2.077   0.0444 *
const     0.02675    0.23554   0.114   0.9102  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 1.349 on 39 degrees of freedom
Multiple R-Squared: 0.1733, Adjusted R-squared: 0.1097 
F-statistic: 2.724 on 3 and 39 DF,  p-value: 0.05723 


Estimation results for equation Pace: 
===================================== 
Pace = ORtg.l1 + Pace.l1 + X3PAr.l1 + const 

         Estimate Std. Error t value Pr(>|t|)
ORtg.l1  -0.03026    0.16414  -0.184    0.855
Pace.l1  -0.11711    0.16283  -0.719    0.476
X3PAr.l1  6.57227   14.77673   0.445    0.659
const    -0.10617    0.24810  -0.428    0.671


Residual standard error: 1.421 on 39 degrees of freedom
Multiple R-Squared: 0.02201,    Adjusted R-squared: -0.05322 
F-statistic: 0.2925 on 3 and 39 DF,  p-value: 0.8305 


Estimation results for equation X3PAr: 
====================================== 
X3PAr = ORtg.l1 + Pace.l1 + X3PAr.l1 + const 

           Estimate Std. Error t value Pr(>|t|)   
ORtg.l1  -0.0008257  0.0018756  -0.440   0.6622   
Pace.l1   0.0011681  0.0018606   0.628   0.5338   
X3PAr.l1  0.1654261  0.1688517   0.980   0.3333   
const     0.0080410  0.0028350   2.836   0.0072 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.01623 on 39 degrees of freedom
Multiple R-Squared: 0.02972,    Adjusted R-squared: -0.04492 
F-statistic: 0.3982 on 3 and 39 DF,  p-value: 0.7551 



Covariance matrix of residuals:
          ORtg      Pace      X3PAr
ORtg  1.818853  0.407221  0.0052772
Pace  0.407221  2.018000 -0.0020829
X3PAr 0.005277 -0.002083  0.0002635

Correlation matrix of residuals:
        ORtg     Pace    X3PAr
ORtg  1.0000  0.21255  0.24106
Pace  0.2126  1.00000 -0.09033
X3PAr 0.2411 -0.09033  1.00000
Code
# Forecast 5 periods ahead
fc_var_final <- predict(final_var, n.ahead = 5)

# Get variable names
fc_var_names <- names(fc_var_final$fcst)
tpar_fc_name <- fc_var_names[grep("3PAr|PAr", fc_var_names, ignore.case = TRUE)]
if (length(tpar_fc_name) == 0) tpar_fc_name <- fc_var_names[3]

# Display forecasts
cat("\n=== 5-Period Forecasts ===\n\n")

=== 5-Period Forecasts ===
Code
cat("ORtg:\n")
ORtg:
Code
print(fc_var_final$fcst$ORtg)
            fcst     lower    upper       CI
[1,]  1.14541147 -1.497891 3.788714 2.643302
[2,] -0.01197986 -2.904812 2.880852 2.892832
[3,]  0.29428871 -2.615367 3.203944 2.909656
[4,]  0.18514537 -2.726620 3.096911 2.911765
[5,]  0.21921670 -2.692756 3.131190 2.911973
Code
cat("\n", tpar_fc_name, ":\n", sep = "")

X3PAr:
Code
print(fc_var_final$fcst[[tpar_fc_name]])
            fcst       lower      upper         CI
[1,] 0.013430804 -0.01838448 0.04524608 0.03181528
[2,] 0.009377456 -0.02292743 0.04168234 0.03230489
[3,] 0.009533668 -0.02277694 0.04184428 0.03231061
[4,] 0.009331515 -0.02297983 0.04164286 0.03231135
[5,] 0.009375650 -0.02293573 0.04168703 0.03231138
Code
cat("\nPace:\n")

Pace:
Code
print(fc_var_final$fcst$Pace)
            fcst     lower    upper       CI
[1,]  0.05172503 -2.732528 2.835978 2.784253
[2,] -0.05861195 -2.873542 2.756318 2.814930
[3,] -0.03730962 -2.852842 2.778223 2.815533
[4,] -0.04804480 -2.863606 2.767516 2.815561
[5,] -0.04481374 -2.860377 2.770749 2.815563
Code
# Plot forecasts
for (vname in fc_var_names) {
    plot(fc_var_final, names = vname)
}

Granger Causality Tests:

Code
# Granger causality tests
var_names <- names(final_var$varresult)
tpar_name <- var_names[grep("3PAr|PAr", var_names, ignore.case = TRUE)]
if (length(tpar_name) == 0) tpar_name <- var_names[3]

granger_3par_ortg <- causality(final_var, cause = tpar_name)$Granger
granger_pace <- causality(final_var, cause = "Pace")$Granger
granger_ortg <- causality(final_var, cause = "ORtg")$Granger

cat("=== Granger Causality Tests ===\n\n")
=== Granger Causality Tests ===
Code
cat("3PAr → {ORtg, Pace}: F =", round(granger_3par_ortg$statistic, 3),
    ", p =", round(granger_3par_ortg$p.value, 4), "\n")
3PAr → {ORtg, Pace}: F = 2.158 , p = 0.1202 
Code
cat("Pace → {ORtg, 3PAr}: F =", round(granger_pace$statistic, 3),
    ", p =", round(granger_pace$p.value, 4), "\n")
Pace → {ORtg, 3PAr}: F = 0.717 , p = 0.4903 
Code
cat("ORtg → {Pace, 3PAr}: F =", round(granger_ortg$statistic, 3),
    ", p =", round(granger_ortg$p.value, 4), "\n")
ORtg → {Pace, 3PAr}: F = 0.122 , p = 0.8851 

Granger causality tests reveal the temporal ordering of the analytics revolution by determining which variables’ past values predict other variables’ future changes. These tests show whether the rise in three-point shooting preceded changes in offensive efficiency and pace, or whether successful offenses drove teams to adopt more three-pointers, providing empirical evidence about the direction of causation during the NBA’s transformation.

Code
var_names_irf <- names(final_var$varresult)
tpar_name_irf <- var_names_irf[grep("3PAr|PAr", var_names_irf, ignore.case = TRUE)]
if (length(tpar_name_irf) == 0) tpar_name_irf <- var_names_irf[3]

# IRFs
irf_3par_ortg <- irf(final_var, impulse = tpar_name_irf, response = "ORtg", n.ahead = 10)
plot(irf_3par_ortg, main = paste("Impulse:", tpar_name_irf, "→ Response: ORtg"))

Code
irf_pace_ortg <- irf(final_var, impulse = "Pace", response = "ORtg", n.ahead = 10)
plot(irf_pace_ortg, main = "Impulse: Pace → Response: ORtg")

Code
irf_ortg_3par <- irf(final_var, impulse = "ORtg", response = tpar_name_irf, n.ahead = 10)
plot(irf_ortg_3par, main = paste("Impulse: ORtg → Response:", tpar_name_irf))

The VAR model reveals meaningful relationships among offensive efficiency, three-point shooting, and pace across NBA history. Granger causality tests quantify whether past values of one variable help predict future values of others, addressing the question of whether the rise in three-point shooting preceded changes in offensive efficiency or vice versa.

Impulse response functions trace how a one-time shock to one variable cascades through the system, showing whether innovations in one aspect of basketball strategy have lasting effects on others or if defensive adjustments eventually neutralize advantages. Together, these tools demonstrate that while the analytics revolution transformed the NBA, reflecting the complex adaptive nature of elite competition where strategic innovations provoke counter-responses.

References

1.
Kubatko, J., Oliver, D., Pelton, K. & Rosenbaum, D. T. A starting point for analyzing basketball statistics. Journal of quantitative analysis in sports 3, (2007).
2.
Skinner, B. The problem of shot selection in basketball. PloS one 7, e30776 (2012).
3.
Franks, A., Miller, A., Bornn, L. & Goldsberry, K. Characterizing the spatial structure of defensive skill in professional basketball. (2015).
4.
Sliz, B. A. An investigation of three-point shooting through an analysis of nba player tracking data. arXiv preprint arXiv:1703.07030 (2017).
5.
Poropudas, J. & Halme, T. Dean oliver’s four factors revisited. arXiv preprint arXiv:2305.13032 (2023).